Skip to content

Run ocrd network sample #449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft

Run ocrd network sample #449

wants to merge 16 commits into from

Conversation

joschrew
Copy link
Contributor

@joschrew joschrew commented Aug 20, 2024

This PR showcases the usage of ocrd network.

Step by step guide to run the example

  • get o copy of this pr / repo and cd into it
  • python or python3 with python-click is needed (e.g.: sudo apt install python3-click)
  • Create a directory for the workspaces and the ocrd-resources, which will be mounted to the containers: mkdir -p /tmp/mydata/ocrd-resources. This can be configured later. The folders must be owned by the current user
  • copy a workspace to process to /tmp/mydata. The workspace I use is finally available here /tmp/mydata/vd18test/mets.xml together with its images in the DEFAULT filegroup. If this name differs, the workflow-script in the next step has to be adjusted
  • run make network-setup (or with a python version set: make network-setup PYTHON=python3.9) to create required files and a venv that can be used as a client
  • run make network-start to start the docker containers (make network-stop and make network-clean to tear down)
  • create the workflow script in file workflow.txt:
tesserocr-recognize -I DEFAULT -O PAGE -P segmentation_level region -P textequiv_level word -P find_tables true -P model Fraktur
  • activate the venv: . run-network/venv/bin/activate and run the workflow with: ocrd-process -m /data/vd18test/mets.xml -w workflow.txt

@stweil
Copy link
Collaborator

stweil commented Aug 20, 2024

Are there already other showcases and documents for ocrd network or even complete installations? I recently started my own first experiments with it (based on a native OCR-D installation, no Docker), found only the documentation which is included in the code and in the specification and therefore appreciate this new sample.

@joschrew
Copy link
Contributor Author

Thank you for your help.

I' m afraid I don't know of any other example deployment for ocrd network by now. I have focused on the docker-deployment so far, because no native installation of processors is needed.

@kba kba self-assigned this Aug 21, 2024
@MehmedGIT
Copy link

MehmedGIT commented Aug 21, 2024

Are there already other showcases and documents for ocrd network or even complete installations? I recently started my own first experiments with it (based on a native OCR-D installation, no Docker), found only the documentation which is included in the code and in the specification and therefore appreciate this new sample.

@stweil, there is also this pad with fast instructions for native environment that we were not able to put under ocrd_network docs yet: https://pad.gwdg.de/Ty6IXzhIRa6AvDdC4kTy_g#. Let me know if you need further assistance. Glad to help.

@joschrew
Copy link
Contributor Author

joschrew commented Feb 6, 2025

Thank you for you input, I think I included all parts in the last commit: d3fa81f. As already noted by bertsky this now depends on OCR-D/core#1303

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting better and better!

@@ -0,0 +1,192 @@
dest: docker-compose.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we could also split up the docker-compose.yml into a fixed (template) part and then include the generated configs. (So for example, we could include docker-compose.servers.yml and docker-compose.processors.yml.)

@bertsky
Copy link
Collaborator

bertsky commented Feb 7, 2025

We now have to start thinking about how to generate ocrd-all-tool.json (and the newly introduced ocrd-all-config.yaml, which could in principle itself be generated from the former, if we would also include the Dockerhub image name into via dockerhub key in the ocrd-tool.json schema, and introduce a central configuration file mapping modules/subrepos to profiles, which information used to be just part of the docker rules).

So far, we required a git checkout of all (enabled) submodules (i.e. OCRD_MODULES), and then ran ocrd-all-tool.py to concatenate all the individual */ocrd-tool.json files. But for the network setting, this seems too expensive: we already fetch all the module Docker images anyway, so perhaps we should instead docker cp each ocrd-tool.json from the images into the CWD. But that in turn requires a list of Dockerhub image names for all (enabled) modules in advance...

Comment on lines +258 to +259
volumes:
- "${{DATA_DIR_HOST}}:/data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joschrew what about the processor resources, shouldn't that be a volume, too?

Preferably as named volume (so module-distributed and user-downloaded files can mix freely), e.g.:

Suggested change
volumes:
- "${{DATA_DIR_HOST}}:/data"
volumes:
- "${{DATA_DIR_HOST}}:/data"
- ocrd-resources:/usr/local/share/ocrd-resources

Unfortunately, it seems we missed the opportunity of the latest sweep across modules to define a unique internal resource location. In ocrd/tesserocr, we already use /models as alias for /usr/local/share/ocrd-resources, but in many other images we still use the latter...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps brevity (/models instead of /usr/local/share/ocrd-resources) is not an issue if we are using scriped mounts and calls, anyway. And /models was a hack of sorts: it was not backed by the spec (which still says /usr/local is the system location, and $XDG_DATA_DIR the data location), but only a convention for our fat container images.

Therefore I will switch ocrd/tesserocr back to /usr/local/share/ocrd-resources as the only place to mount (as in all the other workable slim container images).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore I will switch ocrd/tesserocr back to /usr/local/share/ocrd-resources as the only place to mount (as in all the other workable slim container images).

OCR-D/ocrd_tesserocr@70caca0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a resource-volume now. What would you suggest to get the resources into the volume? I tried using the resmgr which fails because of missing writing permission on /.config (the setup cannot not use root permissions). Would it be a good idea to to this via the Makefile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you suggest to get the resources into the volume? I tried using the resmgr which fails because of missing writing permission on /.config (the setup cannot not use root permissions). Would it be a good idea to to this via the Makefile?

Ah, that bit us before in the fat containers, so I thought we found a remedy for it: setting XDG_CONFIG_HOME=/usr/local/share/ocrd-resources as well, so the volume should cover both the resources themselves and the config file ocrd/resources.yml.

Your /.config indicates that XDG_CONFIG_HOME is not set (as HOME=/). What image is this happening with?

So no, this should work on the user side already.

Copy link
Contributor Author

@joschrew joschrew Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All images. The setup is started as current user, this is necessary because the created workspace files should be of user's permission. If the volume is of root permission (or a named volume, same permissions) the resource download can only triggered by root. But when the container is started as user and then root user is used to download the resources we have logfile permission-errors again. The value of XDG_CONFIG_HOME has no effect on permissions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right! We do have a problem with named volumes. They are always created as root. Even with docker volume create. It seems the only way to get normal ownership is by choosing a container-side parent path which already has the permissions we want (as these get inherited by the named volume). But in cases where the directory itself already exists (from the Docker build), its ownership won't change (i.e. stay root). That is a big problem.

The value of XDG_CONFIG_HOME has no effect on permissions.

It does have the effect of controlling where ocrd/resources.yml gets written.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See moby/moby#3124

Okay, so the only way out AFAICS is setting /usr/local/share/ocrd-resources (and its subdirectories) to 777 in the Dockerfile already. This means in every Dockerfile.

You can test this:

  • revert a9f2ffd
  • write a Dockerfile.fixup, contents:
ARG BASE_IMAGE
FROM $BASE_IMAGE
RUN mkdir -p /usr/local/share/ocrd-resources
RUN find /usr/local/share/ocrd-resources -type d -exec chmod 777 {} ";"
RUN find /usr/local/share/ocrd-resources -type f -exec chmod 666 {} ";"
  • use it to patch a module of your choice, e.g.:
cat Dockerfile.fixup | docker build --build-arg BASE_IMAGE=ocrd/tesserocr -t ocrd/tesserocr:permissions -
  • use the patched image, e.g.:
docker run --rm -it -v ocrd-resources:/usr/local/share/ocrd-resources ocrd/tesserocr:permissions ocrd resmgr download ocrd-tesserocr-recognize fra.traineddata

Copy link
Contributor Author

@joschrew joschrew Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your suggestion, it nearly works for me (setting /usr/local/share/ocrd-resources itself to 777 is also necessary).
Another possibility would be to have more than one folder for the resources (kind of what we have currently). /usr/local/share/ocrd-resoucres could be used for preinstalled resources and the "on-demand"-resources could be installed to another folder (XDG_DATA_HOME/ocrd-resources for example) and for that folder a host-mounted-volume could be used. I'd prefer one folder for all resources but setting global permissions might rise security concerns again like it did when we talked about setting permissions for the logging-folders. Did we have a reason why we don't want to use XDG_DATA_HOME/ocrd-resources?

Edit: suggestion works as expected, I forgot to reset (delete) the mounted named volume

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it nearly works for me (setting /usr/local/share/ocrd-resources itself to 777 is also necessary).

Are you sure you tried the exact command above (which should already cover the top-level directory as well)?

Another possibility would be to have more than one folder for the resources [...]
Did we have a reason why we don't want to use XDG_DATA_HOME/ocrd-resources?

Yes, we did have a reason. Not all processors can handle more than 1 location. Tesseract for example can not be brought to look up in more than 1 directory.

It's also not intuitive. You don't want to have to think about resources in terms of where they came from.

@joschrew
Copy link
Contributor Author

joschrew commented Apr 2, 2025

We now have to start thinking about how to generate ocrd-all-tool.json (and the newly introduced ocrd-all-config.yaml, which could in principle itself be generated from the former, if we would also include the Dockerhub image name into via dockerhub key in the ocrd-tool.json schema, and introduce a central configuration file mapping modules/subrepos to profiles, which information used to be just part of the docker rules).

So far, we required a git checkout of all (enabled) submodules (i.e. OCRD_MODULES), and then ran ocrd-all-tool.py to concatenate all the individual */ocrd-tool.json files. But for the network setting, this seems too expensive: we already fetch all the module Docker images anyway, so perhaps we should instead docker cp each ocrd-tool.json from the images into the CWD. But that in turn requires a list of Dockerhub image names for all (enabled) modules in advance...

This is still an open issue. Currently simply a copied ocrd-all-tool.json is used. But maybe this is the best way for now

@bertsky bertsky mentioned this pull request Apr 29, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants